Engineering posts about High Availability

Curated summaries and key learnings for engineers working with High Availability.

Rethinking Distributed Systems for Serverless Performance and Reliability

The article explores the evolution of serverless compute for Apache Spark, addressing long-standing architectural challenges that have hindered performance and reliability. It emphasizes the need for...

Slack

15m

From SSH to REST: A Security-Driven Modernization of Slack’s EMR Data Pipelines

The article outlines Slack's transition from a legacy SSH-based architecture to a modern REST-based job submission system for its data pipelines. Initially, the reliance on SSH created significant...

Airbnb

10m

Building a fault-tolerant metrics storage system at Airbnb

The article details Airbnb's development of a high-throughput metrics storage system capable of ingesting 50 million samples per second and managing 2.5 petabytes of data. It outlines the challenges...

Cloudflare

12m

Rearchitecting the Workflows control plane for the agentic era

The article discusses the rearchitecting of the Workflows control plane to accommodate a shift towards agent-triggered workflows, necessitated by the increasing demand for durable execution engines...

Salesforce

Building a Distributed Persistent Queue That Scaled AI Workloads 5x Under LLM Rate Limits

The article discusses the engineering of a distributed persistent queue that orchestrates AI workloads and human workflows within strict infrastructure limits. It highlights the challenges of scaling...

Databricks

Zero-Downtime Patching in Lakebase Part 1: Prewarming

The article discusses the challenges associated with planned maintenance in database systems, particularly focusing on the performance degradation caused by cold restarts. It introduces Lakebase's...

Databricks

Multi-Cloud Challenges, Intelligent Load Balancing, and AI-Powered Workflows: Databricks at SRECon 2026

The article highlights Databricks' advancements in infrastructure reliability and efficiency as presented at SRECon 2026. It delves into the challenges of multi-cloud operations, particularly...

Atlassian

13m

Scaling Jira cloud Migrations, One Bottleneck at a Time

The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...

Salesforce

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler...

GitHub

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered...

Databricks

Best Practices for High QPS Model Serving on Databricks

The article outlines best practices for achieving high queries per second (QPS) performance in model serving on Databricks. It emphasizes the importance of low latency and high throughput for...

Airbnb

My Journey to Airbnb — Anna Sulkina

Anna Sulkina's journey to Airbnb highlights her extensive experience in engineering, particularly in application and cloud infrastructure. She transitioned from hardware diagnostics to software...

Meta (Facebook)

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

The article discusses the implementation of backend aggregation (BAG) in Meta's Prometheus AI clusters, highlighting its role in interconnecting thousands of GPUs across multiple data centers. BAG...

Atlassian

23m

How We Unlocked Performance at Scale with Jira Platform

The article discusses the significant rearchitecture of the Jira Cloud platform, transitioning from a single-tenant database to a cloud-native, multi-tenant architecture designed for scalability,...

Databricks

16m

Open Sourcing Dicer: Databricks’ Auto-Sharder

The article announces the open sourcing of Dicer, Databricks' foundational auto-sharding system designed to enhance the performance and reliability of sharded services. Dicer addresses the...

Cloudflare

14m

Engineering posts about High Availability

Rethinking Distributed Systems for Serverless Performance and Reliability

From SSH to REST: A Security-Driven Modernization of Slack’s EMR Data Pipelines

Building a fault-tolerant metrics storage system at Airbnb

Rearchitecting the Workflows control plane for the agentic era

Building a Distributed Persistent Queue That Scaled AI Workloads 5x Under LLM Rate Limits

Zero-Downtime Patching in Lakebase Part 1: Prewarming

Multi-Cloud Challenges, Intelligent Load Balancing, and AI-Powered Workflows: Databricks at SRECon 2026

Scaling Jira cloud Migrations, One Bottleneck at a Time

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Best Practices for High QPS Model Serving on Databricks

My Journey to Airbnb — Anna Sulkina

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

How We Unlocked Performance at Scale with Jira Platform

Open Sourcing Dicer: Databricks’ Auto-Sharder

How Workers powers our internal maintenance scheduling pipeline

Welcoming Stately Cloud to Databricks: Investing in the Foundation for Scalable AI Applications

The Communication Complexity of Distributed Estimation